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ABSTRACT 

Motivation: Next-generation sequencing technologies sequence 
viruses with ultra-deep coverage, thus promising to revolutionize 
our understanding of the underlying diversity of viral populations. 
While the sequencing coverage is high enough that even rare viral 
variants are sequenced, the presence of sequencing errors makes it 
difficult to distinguish between rare variants and sequencing errors. 
Results: In this article, we present a method to overcome the limi- 
tations of sequencing technologies and assemble a diverse viral 
population that allows for the detection of previously undiscovered 
rare variants. The proposed method consists of a high-fidelity 
sequencing protocol and an accurate viral population assembly 
method, referred to as Viral Genome Assembler (VGA). The pro- 
posed protocol is able to eliminate sequencing errors by using in- 
dividual barcodes attached to the sequencing fragments. Highly 
accurate data in combination with deep coverage allow VGA to 
assemble rare variants. VGA uses an expectation-maximization al- 
gorithm to estimate abundances of the assembled viral variants in 
the population. Results on both synthetic and real datasets show 
that our method is able to accurately assemble an HIV viral popu- 
lation and detect rare variants previously undetectable due to 
sequencing errors. VGA outperforms state-of-the-art methods for 
genome-wide viral assembly. Furthermore, our method is the first 
viral assembly method that scales to millions of sequencing reads. 
Availability: Our tool VGA is freely available at http://genetics.cs. 
ucla.edu/vga/ 

Contact: serghei@cs.ucla.edu; eeskin@cs.ucla.edu 
1 INTRODUCTION 

Human immunodeficiency virus (HIV) exhibits high genomic 
diversity within an infected host, which affects many clinically 
important phenotypic traits such as escape from vaccine-induced 
immunity, virulence and response to antiviral therapies (Lauring 
and Andino, 2010). To accurately characterize an intra-host HIV 
population, sequencing technologies must be sensitive enough to 
detect and quantify rare variants (Henn et aL, 2012; Tsibris et aL, 
2009). Next-generation sequencing (NGS) technologies offer 
deep coverage of genomic data in the form of millions of sequen- 
cing reads (Metzker, 2009). While the sequencing coverage is 
high enough to capture rare variants, the presence of sequencing 
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errors makes it difficult to distinguish between rare variants and 
sequencing errOors. Additionally, low viral population variability 
(i.e. pairs of individual viral genomes that have small genetic 
distance) and the presence of individual variants having low 
abundance complicates accessing viral diversity and assembling 
full-length viral variants. 

The full picture of viral diversity in a population remains un- 
discovered due to errors produced by sequencing platforms. 
Current sequencing technologies use different underlying chem- 
istry and offer trade-offs among throughput, read length and 
cost (Metzker, 2009). While the current sequencing platforms 
can potentially detect point-mutations, error rates may result in 
false-positive single nucleotide variant (SNV) calls or wrong 
genome variant sequences. Computational error correction tech- 
niques are able to partially correct the sequencing error and pro- 
vide an opportunity to discover highly expressed individual viral 
genomes, but low abundant variants remain undiscovered. 
Current methods (Astrovskaya, 2011; Mancuso et aL, 2011; 
Prosperi and Salemi, 2012; Zagordi et aL, 2011, 2012) are not 
able to differentiate true biological mutations from sequencing 
artifacts, thus significantly limiting the possibility of a method to 
assemble the underlying viral population. 

In this article, we propose a method to overcome these limi- 
tations by coupling a high-fidelity sequencing protocol (Kinde 
et aL, 2011) with an accurate method, referred to as Viral 
Genome Assembler (VGA), to assemble a heterogeneous viral 
population. High-fidelity sequencing protocol, known as Safe- 
SeqS, has been applied to detect rare somatic mutations, but 
its application on detecting rare viral mutations has been neg- 
lected. Similar to Safe-SeqS we apply a special library prepar- 
ation technique that eliminates sequencing errors during the de- 
multiplexing step. The proposed protocol attaches individual 
barcode sequences during the library preparation step for every 
fragment, then amplifies each tagged fragment. Reads are clus- 
tered according to the original fragment based on the attached 
barcode. An error-correction protocol is then applied for every 
read group resulting in a method that corrects errors inside the 
group and produces a corrected consensus read. Highly accurate 
data in combination with deep coverage allows for accurate es- 
timation of the underlying diversity of a viral population. 
Importantly, the low per-base sequencing cost of the Illumina 
platform makes it realistic to greatly increase coverage to detect 
ultra- rare variants. Our sequencing protocol introduces novel 
challenges for virus assembly and we develop a novel assembly 
approach for reconstructing and estimating the frequency of a 
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large number of closely related viral variants. Our method does 
not rely on the availability of a reference genome. This makes 
our method applicable to newly emerged viruses in which 
genome sequences are unknown. 

The ability to discover rare viral variants makes our tool ap- 
plicable for monitoring and quantifying an HIV population 
structure to dissect its evolutionary landscape and study genomic 
interaction. In particular, our approach allows for the discovery 
of rare mutations and variants that are of particular interest be- 
cause of their potential influence on drug resistance and treat- 
ment failure (Liu et al, 2011; Palmer et al, 2006; Wang et al, 
2007). 



2 METHODS 

2.1 Overview 

Advances in NGS and the ability to generate deep coverage data in the 
form of millions of reads provide exceptional resolution for studying the 
underlying genetic diversity of complex viral populations. However, errors 
produced by most sequencing protocols complicate distinguishing between 
true biological mutations and technical artifacts that confound detection of 
rare mutations and rare individual genome variants. A common approach 
is to use post-sequencing error correction techniques able to partially cor- 
rect the sequencing errors. In contrast to clonal samples, the post-sequen- 
cing error correction methods are not well suited for mixed viral samples 
and may lead to filtering out true biological mutations. For this reason, 
current viral assembly methods are able to detect only highly abundant 
SNV, thus limiting the discovery of rare viral genomes. 

Additional difficulty arises from the genomic architectures of viruses. 
Long common regions shared across viral population (known as conserved 
regions) introduce ambiguity in the assembly process. Conserved regions 
may be due low-diversity population or due to recombination with mul- 
tiple cross-overs. In contrast to repeats in genome assembly, conserved 
regions may be phased based on relative abundances of viral variants. 
Low-diversity viral populations in which all pairs of individual genomes 
within a viral population have a small genetic distance from each other 
may represent additional challenges for the assembly procedure. 

We apply a high-fidelity sequencing protocol to study viral population 
structure (Fig. 1). This protocol is able to eliminate errors from sequen- 
cing data by attaching individual barcodes during the library preparation 
step. After the fragments are sequenced, the barcodes identify clusters of 
reads that originated from the same fragment, thus facilitating error cor- 
rection. Given that many reads are required to sequence each fragment, 
we are trading off an increase in sequence coverage for a reduction in 
error rate. Prior to assembly, we utilize the de novo consensus reconstruc- 
tion tool, Vicuna (Yang et al, 2012), to produce a linear consensus dir- 
ectly from the sequence data. This approach offers more flexibility for 
samples that do not have 'close' reference sequences available. 
Traditional assembly methods (Gnerre et al, 2011; Luo et al., 2012; 
Zerbino and Birney, 2008) aim to reconstruct a linear consensus sequence 
and are not well-suited for assembling a large number of highly similar 
but distinct viral genomes. We instead take our ideas from haplotype 
assembly methods (Bansal and Bafna, 2008; Yang et al., 2013), which 
aim to reconstruct two closely related haplo types. However, these meth- 
ods are not applicable for assembly of a large (a priori unknown) number 
of individual genomes. Many existing viral assemblers estimate local 
population diversity and are not well suited for assembling full-length 
quasi-species variants spanning the entire viral genome. Available 
genome-wide assemblers able to reconstruct full-length quasi-species vari- 
ants are originally designed for low throughput and are impractical for 
high throughput technologies containing millions of sequencing reads. 
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Fig. 1. Overview of high-fidelity sequencing protocol. (A) DNA material 
from a viral population is cleaved into sequence fragments using any 
suitable restriction enzyme. (B) Individual barcode sequences are attached 
to the fragments. Each tagged fragment is amplified by the polymerase 
chain reaction (PCR). (C) Amplified fragments are then sequenced. (D) 
Reads are grouped according to the fragment of origin based on their 
individual barcode sequence. An error-correction protocol is applied for 
every read group, correcting the sequencing errors inside the group and 
producing corrected consensus reads. (E) Error-corrected reads are 
mapped to the population consensus. (F) SNVs are detected and 
assembled into individual viral genomes. The ordinary protocol lacks 
steps (B) and (D) 



We introduce a viral population assembly method (Fig. 2) working on 
highly accurate sequencing data able to detect rare variants and tolerate 
conserved regions shared across the population. Our method is coupled 
with post-assembly procedures able to detect and resolve ambiguity 
raised from long conserved regions using expression profiles (Fig. 2F). 
After a consensus has been reconstructed directly from the sequence data, 
our method detects SNVs from the aligned sequencing reads. Read over- 
lapping is used to link individual SNVs and distinguish between genome 
variants in the population. The viral population is condensed in a conflict 
graph built from aligned sequencing data. Two reads are originated from 
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Colors = independent sets of non-conflicting reads 

Fig. 2. Overview of VGA. (A) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (B) The first step in 
the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the 
'conflict graph'. Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (C) The graph is colored into a minimal 
set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of non-conflicting reads that 
are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet and turquoise), each 
representing individual viral genomes. (D) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes 
are reported. (E) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances 
of viral genomes using the expectation-maximization algorithm. (F) Long conserved regions are detected and phased based on expression profiles. In this 
example red and green viral genome share a long conserved region (colored in black). There is no direct evidence how the viral sub-genomes across the 
conserved region should be connected. In this example four possible phasing are valid. VGA use the expression information of every sub-genome to 
resolve ambiguous phasing 
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different viral genomes if they share different SNVs in the overlapping 
region. Viral variants are identified from the graph as independent sets of 
non-conflicting reads. Non-continuous coverage of rare viral variants 
may limit assembly capacities, indicating that increase in coverage is 
required to increase the assembly accuracy. Frequencies of identified vari- 
ants are then estimated using an expectation-maximization algorithm. 
Compared with existing approaches, we are able to detect rare population 
variants while achieving high assembly accuracy. 

2.2 Error correction 

The proposed sequencing is able to eliminate errors from sequencing data 
and produce highly accurate read sequences. It uses a high-fidelity 
sequencing protocol that attaches individual barcodes during the library 
preparation step. The barcodes are then used to identify reads originated 
from the same fragment, allowing to access multiple sequencing data of 
the same fragment. It follows that every sequenced position of the frag- 
ment would have multiple independent evidence, suitably promoting 
highly accurate consensus reads. By applying an error-correction proced- 
ure of the protocol, we are able to address both sequencing and PCR 
errors, which leads to high assembly accuracy. 

2.3 Consensus construction 

We build a consensus from paired-end reads using Vicuna (Yang et al., 
2012). Our sequencing method should not contain any particularly low 
coverage region allowing reconstruction of population consensus for viral 
sample. In the event that Vicuna produces multiple contigs rather than a 
complete consensus, we use BLAST to merge contigs. We require 50 nt 
overlap to merge any pair of contigs. In the next step, the population 
consensus is used as a reference genome to map reads. Building the ref- 
erence genome from actual sequencing data rather than using an anno- 
tated genome provides us with an accurate and unique mapping. 

2.4 Read mapping 

As with many viral population analyses, the first step of VGA is to map 
the reads. We map reads onto the de novo consensus using InDelFixer 
(Armin and Beerenwinkel, 2013) with default parameters. False read 
alignments are filtered out using fragment length distribution inferred 
from the mapping data. Assuming that the fragment length follows a 
normal distribution (Hormozdiari et al., 2009), we only keep reads with 
fragment length within three standard deviations from the mean. In total 
1.2% of reads have been filtered out versus expected .3% according to the 
three- sigma rule. 

2.5 Viral population assembly 

The combination of deep coverage with high accuracy provides an unpre- 
cedented opportunity for estimating genomic diversity in a viral population. 
The viral population assembly starts with determining pairs of mapped 
reads conflicting with each other in the overlapping region. Following 
Huang et al. (201 1), we construct the conflict graph G=(V,E) with vertices 
corresponding paired-end reads, i.e. V = R, and edges connecting conflict- 
ing pairs of paired-end reads. 

Obviously, any true viral genome corresponds to a maximal independ- 
ent set in the conflict graph (i.e. a maximal set of pairwise nonadjacent 
vertices), although not every maximal independent set necessarily corres- 
ponds to a true viral genome. We adopt a parsimonious approach requir- 
ing to cover the conflict graph with the minimum number of maximal 
independent sets. This problem is equivalent to MIN-GRAPH- 
COLORING which is NP-hard. There exists many heuristics for solving 
this problem [see, e.g. Johnson and Trick, 1996; Kubale, 2004] based on 
greedy selection of a maximal independent set. Unfortunately, our 



attempts to build even a single viral genome failed, as it is difficult to 
arrange paired-end reads into a connected single path. Indeed, a greedy 
algorithm runs out of any possible extension after just a few steps while 
concatenating paired-end reads from left-to-right. 

Instead, we apply an alternative 'top-down' approach of recursive 
graph partitioning along the maximum cut (Max Cut) which has been 
previously successfully applied for human haplotyping (Duitama et al., 
2012). Given a graph G=(V, E), the Max Cut problem asks for parti- 
tioning of the vertices into two components V=V\ U V2 maximizing the 
total number of edges which have one endpoint in V\ and the other in V 2 . 
The Max Cut problem though NP-hard is well approximated by a simple 
0.5-approximation algorithm that randomly assigns vertex to one of the 
two components (Mitzenmacher and Upfal, 2005). Our Max Cut heur- 
istic starts with alternatively assigning left-to-right sorted mapped reads 
to two components and then repeatedly moves one vertex at a time from 
one component to another, improving the solution at each step, until no 
more improvements of this type can be made. 

Our coloring heuristic recursively partitions the conflict graph until 
each component becomes independent. If reads of a given color com- 
pletely cover the consensus genome, then the resulted sequence is ac- 
cepted as the next viral genome. Otherwise, if assembled genome 
contains gaps, we add non-conflicting reads from other color classes in 
left-to-right order in attempt to fill the gaps. If all SNV positions are 
covered, then a newly reconstructed viral genome is added to the set VQ. 
Finally, the genomes whose gaps cannot be filled with the above proced- 
ure are dropped. 



Algorithm 1: VGA Assembly Algorithm. 

Input: Set of reads R aligned to the consensus genome 
Build conflict graph G = (V,E) from set R 
Recursively color G into color classes C using Max Cut 
Initialize the set of complete viral genomes VQ <— 0 
for each color class c,- e C do 

Compute maximal independent set in G = (V,E) containing c t 

Assemble reads in c,- into viral genome g t 

if gi covers all positions in the consensus genome then 
VQ <- VQU{ gi } 

end if 
end for 

Output: Set of complete viral genomes VQ 



2.6 Viral population quantification 

In the final step of the workflow, an expectation-maximization algorithm 
is used to infer the relative abundances of assembled viral quasi-species 
similar to what is described in Eriksson et al. (2008). We extend the 
previous EM and likelihood formulation to incorporate a prior probabil- 
ity for the viral population and compute the maximum a-posteriori esti- 
mate, rather than the MLE. 

Let H be a random variable over the set of viral variant genomes H 
= VQ and let R be a random variable over the set of reads 1Z. Let p[H] 
~Dir(a, . . . , a) be the prior probability of observing a given set of vari- 
ants and denote ph =Pr [H=h] to be the probability of observing a par- 
ticular variant h. The probability of observing read r e 1Z is given by 
marginalizing over all variants 

Pr {R = r] = ^2?r [R = r\H=h] ■ p h 

hen 

where 

f 1 iKh if r is consistent with h 
Pr [R = r\H=h] = l 

0 otherwise 
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Fig. 3. Genomic architecture of 44 real HCV viral genomes from 1739- 
bp-long fragment of E1E2 region. Length of longest common region 
shared between any two viral genomes is represented by color 



and K h is the number of reads consistent with h. We can now define the 
log-posterior as 

log Pr [H\U] = n r • log Pr [R = r] + a ■ ^ log p h - C n 

re.K hen 

where Cn is a constant and n r is the number of reads r. As this function is 
non-convex and difficult to optimize, we solve the easier problem of 
maximizing its lower-bound, 

Yl n * • lo § ( Pr [R = r\H=h]-p h ) + a-J2 lo g Ph 

ren hen hen 

where n rh is the expected number of reads r generated by variant h. The 
EM algorithm computes this by 



n rh = n, 



p h .Fr [R = r\H=h] 
Pr[R = r] 



and subsequently maximizes the log-posterior with the MAP estimate 
given by 



p h = I a + n rh ) / 1 a + n r 



3 RESULTS 

3.1 Performance of VGA on simulated data 

Because the ground truth is unknown for sequenced viral popu- 
lations, simulations present a standardized way to assess the per- 
formance of viral assembly tools. The proposed high-fidelity 
protocol allows to correct sequencing errors, thus giving access 
to highly accurate sequencing data. Post-sequencing error-cor- 
rection techniques are available for reads obtained by regular 
protocol offering the possibility to partially correct sequencing 
errors trading off for real biological mutations. Grinder (Angly 
et aL, 2012) is used to generate reads from both the high-fidelity 
and regular sequencing protocol. Reads are generated from both 
real and synthetic viral variants with different sequencing par- 
ameters and viral expression profiles. Grinder is a state-of-the-art 
sequencing read simulator able to produce shotgun sequencing 
data from a viral population with different expression profiles. 
We mapped the simulated paired-end reads onto the consensus 
using Mosaik. The consensus was constructed using Vicuna 
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Fig. 4. Accuracy of population size prediction. Up to 200 viral genomes 
were generated from the Gag/Pol 3.4 kb HIV region. The population 
diversity is 5-10%. Viral genome abundances follow power-law and uni- 
form distributions. Consensus error-corrected 1002 bp paired-end reads 
were simulated from HIV population 



(Yang et aL, 2012), a de novo assembly tool able to produce a 
linear consensus from deep paired-end sequencing data (see 
Section 2.3 for details). 

We use sensitivity and positive predictive value (PPV) to evalu- 
ate the quality of viral genomes assembled by VGA. We consider 
fully assembled viral genome without errors. Sensitivity is 
defined as the portion of assembled quasi- species that match 
true quasi-species, i.e. Sensitivity = TP / (TP + FN). Positive pre- 
dictive value is defined as the portion of true sequences among 
assembled sequences, i.e., PPV=TP/(TP + FP). Additionally, 
we evaluate ability of our method to estimate population size 
(i.e. number of viral genomes in the population). Accuracy of 
population size prediction is defined as a ratio between estimated 
and true population sizes. Finally, we use Jensen-Shannon di- 
vergence (JSD) to measure the accuracy of frequency estimation. 
Given two probability distributions, JSD measures the 'distance' 
between them, or in other words, the quality of approximation of 
one probability distribution by the other distribution. It is 
defined as the Kullback-Leibler divergence from distributions 
P and Q to their mixture. Formally, the JSD between true dis- 
tribution P and approximation distribution Q is given by the 
formula 



\v K l{P\\M)+ X -V kl {Q\\M) 



JSD(P||0 

where Kullback-Leibler divergence D^x is 



D«(P||0 = X>(Olog 



and M= \ (P+Q). The motivation for using JSD is a conse- 
quence of KL divergence being undefined when assembly meth- 
ods fail to reconstruct some variant i, hence forcing Q(i) to be 0. 
JSD averts this by measuring the distance to the mixture, which 
contains all true and called variants (TP and FP). 

Our first simulated study compares the assembly accuracy 
across different virus species. We focus on effect of read-length 
and throughput on assembly quality for different types of 
viruses. Paired-end reads of various length corresponding to 
high-fidelity and regular sequencing protocols are simulated 
from HIV and HCV populations assuming uniform and 
power-law distributions. A power-law distribution (i.e. frequency 
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Fig. 5. Assembly accuracy estimation. Up to 200 viral genomes were 
generated from the Gag/Pol 3.4 kb HIV region. The population diversity 
is 3-20%. Viral genome abundances follow power- law and uniform dis- 
tributions. Consensus error-corrected 2 100 bp paired-end reads were 
simulated from HIV population 
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Fig. 6. Assembly accuracy estimation. Up to 200 viral genomes were 
generated from the Gag/Pol 3.4 kb HIV region. The population diversity 
is 3-20%. Viral genome abundances follow power-law and uniform dis- 
tributions. Consensus error-corrected 2x 100 bp paired-end reads were 
simulated from HIV population 



of an individual viral genome is a power of the previous one) 
corresponds to a population with several dominant variants and 
many rare variants. The uniform distribution has equal frequen- 
cies for all viral genomes. HCV population is presented by 1 739- 
bp-long fragment from the E1E2 region of 44 real HCV 
sequences. HIV population consist of 10 real intra- host viral 
variants mixture from 1.3-kb-long HIV-1 region, which included 
pol protease and part of the pol reverse transcriptase (Zagordi 
et al, 2010). 

The genomic architecture across virus species was investigated 
and its influence on assembly accuracy was studied. HCV virus 
exhibits more complex genomic architecture with lower popula- 
tion diversity and longer conserved regions (Fig. 3) than HIV. 
Conserved regions were present in both viruses, although only 
HCV contains conserved regions longer then 450 bp. Conserved 
regions longer then the average fragment length (450 bp) may 
introduce ambiguity in the assembly process due to a lack of 
direct evidence of sub-genomes phasing across the conserved 
region. We performed simulated sequencing experiment where 
the average fragment amplification rate is 5, resulting in a five 
time decrease in throughput due to the consensus error correc- 
tion performed by the high-fidelity sequencing protocol. Also the 
simulation experiments were adjusted to simulate a non-uniform 
amplification rate. Non-uniform amplification rate results in dis- 
carding fragments with insufficient amplification rate (<3). From 
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Fig. 7. Assembly accuracy estimation. Consensus error-corrected paired- 
end reads of various lengths were simulated from a mixture of 10 real 
viral clones from 1.3-kb-long HIV-1 region. Assembly accuracy as mea- 
sured by PPV and sensitivity. Results are for 50000 reads, no improve- 
ment was observed when increasing the number of reads 

real studies it is known that around 10% of fragments are ampli- 
fied less than three times. Sequencing errors produced by the 
regular protocol limited the ability of VGA to accurately assem- 
ble a viral population. All assembled variants contained large 
number of mismatches, additionally VGA significantly overesti- 
mated population size. 

As expected, short read lengths dramatically inhibit reconstruc- 
tion, which is evidenced by VGA failing to produce any full- 
length genomes when given 2x36 bp reads (Fig. 4). Because 
common regions for distinct HCV viral genomes are significantly 
longer than for HIV, it is not surprising that performance of VGA 
is worse on HCV data — for 3 M 2 x 150 bp reads simulated from 
44 1739-bp-long viral genomes, sensitivity is 50%, PPV is 80%. 
Results on HCV data confirm that the lower mutation rate and 
presence of conserved regions have a negative impact on the abil- 
ity to accurately reconstruct individual viral genomes. 
Surprisingly, increasing the read length for HIV from 100 to 
150 bp yields no benefits for reconstruction accuracy suggesting 
that 100 bp read length is enough to distinguish between HIV 
viral variants with high mutation rate. Although further experi- 
ments are needed to determine optimum read length, our simu- 
lations suggest that 2 x 100 bp is recommended for small HIV 
viral populations and 2x 150 bp is recommended for medium 
HCV population with complex genomic architecture. 

We separately analyzed the ability of our method to estimate 
the viral population size (i.e. number of genomic variants present 
in the population). Non-continuous coverage limits the ability of 
the method to assemble full-length viral variants. To evaluate the 
accuracy of population size estimation, we compared the true 
population size known from simulated data with estimated re- 
sults. Continuous coverage of each individual viral genome pre- 
sent in the sample has a strong impact on quality of population 
assembly. The probability of non-continuous coverage increases 
dramatically for viral genomes with low abundance. Thus, the 
presence of coverage gaps for rare variants introduces additional 
challenges in the assembly process, making rare genomes un- 
reachable by assembly tools. The number of problematic gen- 
omes can be reduced by increasing sequencing depth; however, it 
does not guarantee complete elimination. While complete assem- 
bly of all such genomes is unrealistic, it is still possible to estimate 
the number of viral genomes present in the sample (population 
size). The number of independent sets reported by VGA provides 
us with an accurate population size estimation. Intuitively, 
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predicting the population size of a large viral population with 
many rare variants is more difficult than predicting for uniformly 
distributed or small populations (Fig. 5). The predicted popula- 
tion size may serve as an indication of insufficient coverage to 
detect the full viral diversity present in the sample. 

Deep coverage is a key for accurate estimation of underlying 
viral diversity. One such platform capable of offering millions of 
sequencing reads is Illumina HiSeq. The relatively short length of 
the produced reads is compensated for by sequencing the same 
fragment from both ends; therefore, producing coupled reads 
separated by a 'gap', known as paired-end read. To our know- 
ledge, VGA is the first method scalable to millions of short 
paired-end sequencing reads able to produce full-length viral 
variants spanning the entire viral genome. We explore the influ- 
ence of sequencing depth on the reconstruction accuracy for 
varying population structures (uniform and power-law distribu- 
tions of viral genomes within the population). HIV-1 is known to 
have greater genetic variability than any other known virus 
(Ndungu and Weiss, 2012). The diversity among viral genomes 
in an HIV population can vary from 3 to 20% depending on 
regions (Martins et ai, 1992; Yoshimura et ai, 1996). 
Heterogeneous viral samples were prepared by generating viral 
populations from the Gag/Pol 3.4 kb HIV region. We simulated 
variant abundances adhering to either a uniform or power-law 
distribution. Not surprisingly, our simulations suggest that 
increased sequencing depth has a direct positive effect on the 
discovery of rare variants and improves the overall assembly 
accuracy. Figure 6 shows the effect of coverage and population 
size on assembly for reads of length 100 bp. Throughout all ex- 
periments, VGA maintained a PPV value of 100%. 

In addition to point mutations, genetic recombination facili- 
tates rapid evolution and production of diverse HIV genomes. 
Indeed, co-infected cells may produce recombinant viral progeny 
at levels lower than mutation rates in an intra-patient environ- 
ment (Neher and Leitner, 2010). Hence, simulated datasets must 
account for both possible phenomena when determining the 
quality of assembly. We utilize a simulation model able to inte- 
grate both point mutations and recombination in the generated 
viral population depending on the amount of diversity required. 
A mixture of 10 real intra-host viral variants from 1.3-kb-long 
HIV-1 form the basis population. In addition to point mutations, 
our simulation model implicitly produces recombinant genomes 
by first constructing the genotype (i.e. sequence of SNVs) for the 
population. A random walk is performed over this genotype as 
specified number of times. Any cross-over that occurs represents 
a new recombination between the 'left' and 'right' original gen- 
omes. Recombinations are implicitly produced, and no control is 
imposed over number and length of the recombination. This 
model produces highly recombinant data on average, posing 
challenges for assembly and can be used to assess assembly qual- 
ity. Simulation model incorporate mutation into the process by 
selecting a position and nucleotide-swap uniformly at random. 
Simulation results (Fig. 7) suggest that our method can accur- 
ately assemble viral population in presence of recombinations 
and point mutations, maintaining PPV of 100%. 

Finally, we evaluate population quantification accuracy, i.e. 
the accuracy of our method in predicting abundances of the 
assembled variants. Taking the results from VGA on 10 real 
HIV clones with 50000 reads and 2x100 bp, the JSD was 



2.93e-05 for the Power-law and 0.001 for the Uniform-based 
populations. This already small measure only decreases as the 
size of the input grows. 

3.2 Performance of existing viral assemblers on simulated 
consensus error-corrected reads 

We have evaluated the performance of ShoRAH (Zagordi et ai, 
2011) and QuasiRecomb (Zagordi et aL, 2012) for simulated 
consensus error-corrected read data. 

ShoRAH disregards pairing information of reads, but it is 
scalable enough to handle up to 1 M reads. ShoRAH fails to 
produce full-length viral genome but reliably spans 98% of the 
consensus genome. It reasonably estimates the number of differ- 
ent viral genome, but even the most accurate ShoRAH- 
assembled viral genome differs from the closest true 1.3-kb- 
long viral genome in five nucleotides. 

QuasiRecomb is designed to handle paired-end read data and 
manages to produce full-length viral genomes. Unfortunately it 
can reliably process no more than 100 K reads. Also the number 
of assembled distinct viral genomes is 10-200 times more than 
the number of true distinct viral genomes. The most accurate 
QuasiRecomb-assembled viral genome still differs from the clo- 
sest true 1.3-kb-long viral genome in four nucleotides. 

Unfortunately we could not compare our method with 
QColors (Huang et ai, 2011) assembly algorithm, which uses a 
similar conflict graph to represent viral population. A CSP solver 
is used by QColors for coloring the graph which may limit its 
scalability to high-throughput datasets consisting of millions of 
sequencing reads. Currently, QColors is not publicly available 
(Upon querying for information on obtaining QColors, the au- 
thors were informed that the original software was tightly 
coupled for the analyses done in its original manuscript, and is 
not currently available for general use.). 

3.3 Performance of VGA on real HIV data 

To further test the ability of VGA to accurately assemble a di- 
verse natural occurring population and predict variant abun- 
dance levels, we used an Illumina HiSeq HIV dataset, which 
consisted of 15 M 2 x 100 bp paired-end reads with attached 
barcodes. Next, the high-fidelity sequencing protocol able to 
eliminate sequencing errors was applied resulting in 3.2 M con- 
sensus error-corrected reads (further referred to as reads). The 
reads were then used to build de novo population consensus using 
Vicuna 1.3 (Yang et ai, 2012). When run on our real data, 
Vicuna produced four contigs of average length 1195 bp. Each 
contig was then run through BLAST to check for overlaps. Once 
overlaps were found, the contigs were assembled into a final 
consensus of length 4337 bp. 

Validation of de novo consensus. A de novo assembled consensus 
was compared against reference-based consensus. To produce 
reference-based consensus, we iteratively map reads onto the 
HIV reference (Gag/Pol 3.4 kb HIV region) using InDelFixer. 
InDelFixer iteratively changes the reference genome based on 
the mapping of the current iteration. Also, we used InDelFixer 
in single iteration mode to map reads onto the constructed de 
novo consensus. De novo consensus is longer (4337 bp) than the 
reference-based consensus (3440 bp) and contained two regions 
with extremely peaked coverage compared with the surrounding 
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regions. Both regions were considered to be the result of technical 
artifacts and removed from further consideration. After removing 
both regions, the length of new de novo consensus becomes 
3452 bp. We also filtered reads that belonged to regions with ex- 
treme coverage. Finally we compared the number of reads 
mapped to the reference-based consensus versus the de novo con- 
sensus. A larger amount of reads mapped to the assembled con- 
sensus, thereby highlighting the advantage of de novo procedure 
for consensus construction over a reference-based. 

From the de novo consensus, VGA assembled 32 full-length 
viral genomes that differ from each other in 2145 SNVs. Among 
known HIV sequences, Gag/Pol is the closest to the de novo 
consensus. Each of the 32 full-length viral genomes do not con- 
tain stop codons inside two known coding regions of Gag/Pol of 
length 1520 and 1820 bp, respectively. Alternatively, when VGA 
is applied to all 15 M original uncorrected reads, 57 distinct viral 
genomes are assembled among which 36 contain stop codons in 
the two coding regions. This shows that a regular sequencing 
protocol is unsuitable for viral genome reconstruction. 

4 DISCUSSION 

We have presented VGA, an accurate method for viral popula- 
tion assembly from ultra-deep sequencing data. The proposed 
algorithm is coupled with a high-fidelity sequencing protocol 
able to eliminate errors from sequencing data. Deep coverage 
in combination with highly accurate data allows our method to 
accurately estimate the underlying diversity of a viral population. 
In particular, it makes possible to distinguish true biological mu- 
tations from sequencing errors, facilitating assembly of rare in- 
dividual genomes. Our method condenses the viral population 
into a conflict graph built from aligned reads. To distinguish 
between viral variants, the conflict graph is colored into a min- 
imal set of colors. Each color represents individual viral genomes 
composed from the set of non-conflicting reads. An expectation- 
maximization algorithm was used to estimate relative abundance 
frequencies of assembled viral genomes. 

To our knowledge, our method is the first viral assembly 
method that scales to millions of paired-end sequencing reads. 
Experiments on both real and synthetic HIV datasets generated 
with various sequencing parameters and distribution assump- 
tions suggest that VGA is able to assemble diverse viral popula- 
tion from millions of paired-end reads. The ability of our method 
to maintain 100% assembly accuracy makes it suitable for clin- 
ical applications. In addition, the constant increase of sequencing 
depth offered by high-throughput technologies provide us with 
unprecedented resolution promising to increase number of dis- 
covered ultra-rare viral variants in the population. 
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